Geostatistics project for CNAM
Geolocation-based user generated data on social networks can be a gold mine to study urban dynamics. For instance, Mukhina, Rakitin and Visheratin (2017) used Instagram pictures to analyse differences in places visited by locals and tourists (Mukhina, Rakitin, and Visheratin 2017). Getting those kinds of data is however quite difficult, thankfully for our privacy protection.
Fortunately, the data of location review websites are public, and the most prominent of them, Google Places, offer an API to query the details for over 150 millions user created places around the world.
But even with this, there are strong limitations:
Only 5 000 API calls per month are free.
A maximum of 60 places can be retrieved per search.
A maximum of 20 results can be retrieved per query. If there are more than 20 results, another query should be made to get the next results.
Few fields are available by area place search.
User written reviews are barely accessible.
A R package to query the Google Maps API is readily available on CRAN(Cooley 2018) but I have chosen to code my own function, in order to better understand the URL construction and the JSON search responses of the API.
Compare to this package, I have added a csv file export as well as automatic subsequent new queries to get the next 20 results when it is needed.
There is an interval of 10 seconds between each query to take into account the delay of generation of the JSON file, as described in the documentation(Google 2020):
There is a short delay between when a
next_page_tokenis issued, and when it will become valid. Requesting the next page before it is available will return anINVALID_REQUESTresponse.
In addition, I have restricted the fields saved to only those which are useful for this study.
getPlaces <- function(APIKey, latitude, longitude, radius, savedir = NULL) {
require(jsonlite)
require(plyr)
require(dplyr)
baseUrl = "https://maps.googleapis.com/maps/api/place/nearbysearch/json?"
location = paste0("location=", latitude, ","
, longitude, "&", "radius=", radius)
i=0
assign(paste0("jsonResults",i),
fromJSON(paste0(baseUrl, APIKey, "&", location)))
while (exists("next_page_token", get(paste0("jsonResults",i)))) {
token <- get("next_page_token", get(paste0("jsonResults",i)))
i <- i + 1
Sys.sleep(10)
assign(paste0('jsonResults',i),fromJSON(paste0(baseUrl, APIKey, "&",
"pagetoken=", token)))
}
results <- as.data.frame(cbind(
"Id" = jsonResults0$results$place_id,
"Name" = jsonResults0$results$name,
"Lat"= jsonResults0$results$geometry$location$lat,
"Lng"= jsonResults0$results$geometry$location$lng,
"Price_level" = jsonResults0$results$price_level,
"Rating" = jsonResults0$results$rating,
"Nb_Rating" = jsonResults0$results$user_ratings_total,
"Type" = jsonResults0$results$types
))
if(exists("jsonResults1")) {
results <- rbind.fill(results, as.data.frame(cbind(
"Id" = jsonResults1$results$place_id,
"Name" = jsonResults1$results$name,
"Lat"= jsonResults1$results$geometry$location$lat,
"Lng"= jsonResults1$results$geometry$location$lng,
"Price_level" = jsonResults1$results$price_level,
"Rating" = jsonResults1$results$rating,
"Nb_Rating" = jsonResults1$results$user_ratings_total,
"Type" = jsonResults1$results$types
)))
}
if(exists("jsonResults2")) {
results <- rbind.fill(results, as.data.frame(cbind(
"Id" = jsonResults2$results$place_id,
"Name" = jsonResults2$results$name,
"Lat"= jsonResults2$results$geometry$location$lat,
"Lng"= jsonResults2$results$geometry$location$lng,
"Price_level" = jsonResults2$results$price_level,
"Rating" = jsonResults2$results$rating,
"Nb_Rating" = jsonResults2$results$user_ratings_total,
"Type" = jsonResults2$results$types
)))
}
if (!is.null(savedir)) {
filename <- paste0(savedir, "results-", latitude, "-",
longitude, "-", radius, ".csv")
results2 <- results
results2$Type <- sapply(results$Type, paste, collapse = " ")
results2 <- results2 %>% replace(.=="NULL", NA)
results2 <- data.frame(sapply(results2,unlist))
write.csv(results2, file = filename, row.names=FALSE)
}
results
}
The lowest official decomposition of Paris is the “district”. There are eighty districts, four for each of the twenty arrondissements.
The Shapefile with administrative details for each district and their geospatial vector data is available on the open data website of the city of Paris(Paris 2017).
I then filtered out all the arrondissements but the six of the North-East of Paris, the 9th, 10th, 11th, 18th, 19th and 20th.
Paris_by_district <- st_read("quartier_paris/quartier_paris.shp")
Reading layer `quartier_paris' from data source `/Users/BeMayer/Google Drive/Projets dev/WordDensityMap/quartier_paris/quartier_paris.shp' using driver `ESRI Shapefile'
Simple feature collection with 80 features and 8 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 2.224078 ymin: 48.81558 xmax: 2.469761 ymax: 48.90216
CRS: 4326
plot(st_geometry(Paris_by_district))
North_East <- Paris_by_district %>% filter(c_ar %in% c(9,10,11,18,19,20))
tm_shape(North_East) +
tm_borders(col = "grey60", lwd = 0.5) +
tm_fill(col = "c_ar", style="cat", title = "Arrondissement",
palette = brewer.pal(6, "Set3")) +
tm_text(text = "l_qu", size = 0.6)
I tried to find the best solution to cover the maximal area using the minimum number of queries. What I naively thought at first sight to be a trivial problem, is in fact a field of research by itself, “circle packing” (Wikipedia 2020a).
The optimal solution would probably be given by the algorithm developed by Das (2006) (Das et al. 2006), but I could not find any implementation already available in R.
In the absence of a better solution, I have used a staggered grid, as pictured below.
Let’s consider the superposition between two circles.
The area in the square is \(r^2\).
The area of the blue zone plus the grey zone is \(\frac{\pi r^2}{4}\).
So the area of light brown zone - and similarly the area of the blue zone - is \(r^2-\frac{\pi r^2}{4}\).
Thus the ratio between the grey area and the area of the square is equal to:
\(\frac{r^2-2(r^2-\frac{\pi r^2}{4})}{r^2}=\frac{\pi}{2}-1\approx57\%\)
This ratio is equal to the maximum superposition between all API calls using the staggered grid.
A superposition of 57% is probably very far from optimal but it can also be an advantage, given that in some very dense neighbourhoods of Paris there can be more than 60 places in even a small area, so some places may not be retrieved. Querying the same area twice may help mitigate this limitation.
I have written the function getGrid to generate the grid according to the parameters north, east, south, west and radius_in_meters.
getGrid <- function(north, east, south, west, radius_in_meters) {
radius_in_longitude <- radius_in_meters / 73000
radius_in_latitude <- radius_in_meters / 111110
grid <- data.frame(matrix(ncol = 3, nrow = 0))
long <- 0
lati <- 0
index <- 0
while (south + lati <= north) {
while (west + long <= east) {
grid <- rbind(grid, c(index, west + long, south + lati))
index <- index + 1
long <- long + 2 * radius_in_longitude
}
long <- 0
lati <- lati + 2 * radius_in_latitude
}
long <- radius_in_longitude
lati <- radius_in_latitude
while (south + lati < north) {
while (west + long < east) {
grid <- rbind(grid, c(index, west + long, south + lati))
index <- index + 1
long <- long + 2 * radius_in_longitude
}
long <- radius_in_longitude
lati <- lati + 2 * radius_in_latitude
}
names <- c("index", "longitude", "latitude")
colnames(grid) <- names
grid
}
A conversion between degrees and meters is made.
A degree in latitude is always equal to 111 kilometers but a degree in latitude depends on the latitude of the region considered.
The length of 1° of longitude in kilometers is given by the formula below:
\(1°longitude = 111 km * cos(latitude)\)
Since Paris is at about 49° degrees of latitude,
\(1°longitude = 111 km * cos(49) = 73km\)
To find the parameters north, east, south and west for the getGrid function we use the bounding of the Paris_by_district object, given by the function st_bbox. They are the limits at the North, East, South and West of the city of Paris.
st_bbox(Paris_by_district)
xmin ymin xmax ymax
2.224078 48.815577 2.469761 48.902162
By trial and error, I have found that the optimal radius_in_meters to cover the North-East of Paris in a maximum of 5 000 Google Maps API call is 52.5 meters.
The grid is generated for a whole rectangle around Paris, converted to an sf object and then the points are filtered to only keep those in the North East of Paris.
Grid_Paris <- getGrid(48.91, 2.47, 48.81, 2.22, 52.5) %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
st_filter(st_union(North_East$geometry))
nrow(Grid_Paris)
[1] 4964
Exactly 4 964 points on the grid are generated.
To plot the grid, the objects should projected to a planar projection instead of using geodetic coordinates (longitude and latitude degrees).
The official projection in France is Lambert 93 (Wikipedia 2020b), its references are available on the spatialreference.org website (reference 2020).
Lambert <- crs("+proj=lcc +lat_1=49
+lat_2=44 +lat_0=46.5 +lon_0=3 +x_0=700000
+y_0=6600000 +ellps=GRS80
+towgs84=0,0,0,0,0,0,0 +units=m +no_defs")
North_East_Lambert <- North_East %>%
st_transform(crs = Lambert) %>%
st_geometry()
Grid_Paris_Lambert <- Grid_Paris %>%
st_transform(crs = Lambert) %>%
st_buffer(dist = 52.5) %>%
st_geometry()
plot(North_East_Lambert, col = sf.colors(12, categorical = TRUE),
border = 'grey')
plot(Grid_Paris_Lambert, col = adjustcolor("skyblue", alpha.f=0.1),
border = 1, add = TRUE)
Using the leaflet package, it is also easily possible to create an interactive map of the grid previously generated. On the contrary to the plot function, leaflet use geodetic coordinates to work properly.
Grid_Paris_Leaflet <- Grid_Paris_Lambert %>%
st_transform(crs = "+proj=longlat +datum=WGS84")
leaflet(Grid_Paris_Leaflet) %>%
addTiles() %>%
addPolygons(color = "#444444", weight = 1, smoothFactor = 0.5,
opacity = 1.0, fillOpacity = 0.5)